Optimistic Concurrency Control in a Distributed NameNode Architecture for Hadoop Distributed File System
نویسنده
چکیده
The Hadoop Distributed File System (HDFS) is the storage layer for Apache Hadoop ecosystem, persisting large data sets across multiple machines. However, the overall storage capacity is limited since the metadata is stored in-memory on a single server, called the NameNode. The heap size of the NameNode restricts the number of data files and addressable blocks persisted in the file system. The Hadoop Open Platform-as-a-service (Hop) is an open platform-as-a-Service (PaaS) support of the Hadoop ecosystem on existing cloud platforms including Amazon Web Service and OpenStack. The storage layer of Hop, called the Hop-HDFS, is a highly available implementation of HDFS, based on storing the metadata in a distributed, in-memory, replicated database, called the MySQL Cluster. It aims to overcome the NameNode’s limitation while maintaining the strong consistency semantics of HDFS so that applications written for HDFS can run on Hop-HDFS without modifications. Precedent thesis works have contributed for a transaction model for Hop-HDFS. From system-level coarse grained locking to row-level fine grained locking, the strong consistency semantics have been ensured in Hop-HDFS, but the overall performance is restricted compared to the original HDFS. In this thesis, we first analyze the limitation in HDFS NameNode implementation and provide an overview of Hop-HDFS illustrating how we overcome those problems. Then we give a systematic assessment on precedent works for Hop-HDFS comparing to HDFS, and also analyze the restriction when using pessimistic locking mechanisms to ensure the strong consistency semantics. Finally, based on the investigation of current shortcomings, we provide a solution for Hop-HDFS based on optimistic concurrency control with snapshot isolation on semantic related group to improve the operation throughput while maintaining the strong consistency semantics in HDFS. The evaluation shows the significant improvement of this new model. The correctness of our implementation has been validated by 300+ Apache HDFS unit tests passing.
منابع مشابه
High Scalability of HDFS using Distributed Namespace
In data intensive computing, Hadoop is widely used by organizations. The client applications of Hadoop require high availability and scalability of the system. Mostly, these applications are online and their data growth rate is unpredictable. The present Hadoop relies on secondary namenode for failover which slows down the performance of the system. Hadoop system’s scalability depends on the ve...
متن کاملSustainability of Hadoop Clusters
Hadoop is a set of utilities and frameworks for the development and storage of distributed applications in cloud computing, the core component of which is the Hadoop Distributed File System (HDFS). NameNode is a key element of its architecture, and also its “single point of failure”. To address this issue, we propose a replication mechanism that will protect the NameNode data in case of failure...
متن کاملNameNode and DataNode Coupling for a Power-Proportional Hadoop Distributed File System
Current works on power-proportional distributed file systems have not considered the cost of updating data sets that were modified (updated or appended) in a low-power mode, where a subset of nodes were powered off. Effectively reflecting the updated data is vital in making a distributed file system, such as the Hadoop Distributed File System (HDFS), power proportional. This paper presents a no...
متن کاملThe Recovery System for Hadoop Cluster
Due to brisk growth of data volume in many organizations, large-scale data processing became a demanding topic for industry as well as for academic fields. Hadoop is widely adopted in Cloud Computing environment for unstructured data. Hadoop is an open source, a java based distributed computing framework, and supports large-scale distributed data processing. In the recent years, Hadoop Distribu...
متن کاملScaling HDFS with a Strongly Consistent Relational Model for Metadata
The Hadoop Distributed File System (HDFS) scales to store tens of petabytes of data despite the fact that the entire le system's metadata must t on the heap of a single Java virtual machine. The size of HDFS' metadata is limited to under 100 GB in production, as garbage collection events in bigger clusters result in heartbeats timing out to the metadata server (NameNode). In this paper, we addr...
متن کامل